Exploratory data analysis of a healthcare-fraud dataset from Kaggle is presented.
Section 2 describes how clues in the dataset yield informed guesses about its scope. Terminology and data integrity are also discussed.
The exploratory analysis presented in Sec. 3 has two goals:
A summary of fraud associations is presented in Sec. 4.
The dataset contains training data and test data, each with four csv files.
The training data was used for the analysis. Details of the csv files:
The patients in the dataset appear to be Medicare patients residing in the US and Africa.
While 9.4% of the providers in the training data have the “potential fraud” label, 38% of the claims were submitted by these providers. Even more striking, 58% of the inpatient claims were submitted by “potential fraud” providers.
The hospital visits in the dataset occurred throughout 2009 and were mainly limited to 2009.
Multiple clues in the dataset indicate that the hospital patients are Medicare patients. Medicare covers patients who are 65 or older, certain younger people with disabilities, and people with end-stage renal disease (permanent kidney failure).
The patients appear to be residents of the US and Africa.
The number of healthcare providers in the dataset (including both training and test data) is 6,763. This is somewhat larger than the number of hospitals in the US in 2009 (5,795).
The dataset includes information for
The number of patients in the dataset is
Within the Medicare system, payments for doctor services are distinct from payments to hospitals. The dataset does not appear to include information about payments to doctors.
An outpatient hospital visit is often defined as a visit for which it is not necessary to stay overnight at the hospital.
However, Medicare defines an outpatient as one who has not formally been admitted to a hospital with a doctor’s order. A Medicare publication explains that the decision to formally admit a patient is “a complex medical decision,” and even a stay of two or more nights at a hospital does not guarantee that a patient will be an inpatient.
The question of whether to formally admit a patient can be financially significant for the patient, because inpatient and outpatient visits are covered by different parts of the Medicare program (Part A and Part B, respectively).
The dataset appears clean, and it passed several checks for consistency among the data frames.
There is an apparent inconsistency in some outpatient visits: the two features that report payments for a visit (DeductibleAmtPaid, InscClaimAmtReimbursed) are both equal to zero in some cases.
In most (or possibly all) cases, NA in the csv files can be viewed as an artifact of the way in which the data is presented, rather than an indication that a valid data point is missing.
Columns with NA:
The tables for inpatient and outpatient visits include a number of columns in which codes relevant for the visit can be reported. For example, there are 10 columns ClmDiagnosisCode_1 through ClmDiagnosisCode_10 for claim diagnosis codes, and there are many NA in these columns.
Similarly, three columns are reserved for physicians associated with the visit: AttendingPhysician, OperatingPhysician, OtherPhysician.
For about 2% of the inpatient visits (but not for any outpatient visits), DeductibleAmtPaid is NA. As noted in the section on dataset scope, the deductible paid by inpatients was $1068 in all cases where it was not NA.
Healthcare providers
Hospital visits
Each hospital visit is associated with a claim made by a provider. For purposes of exploratory data analysis, claims made by a provider that has been flagged for potential fraud also receive the “potential fraud” label.
Patients
Doctors
Since the dataset appears to represent a set of visits covered by Medicare, it is expected that the patients are mainly elderly and that many have chronic health problems. However, the frequency of some chronic conditions among these patients is quite high, and most patients have multiple chronic conditions.
Chronic conditions are widespread even among the Medicare patients who are not elderly. In the plot below, the patients under the age of 65 are disabled or have end-stage renal disease (permanent kidney failure).
The chronic conditions are widespread among the Medicare patients in all 50 US states as well as in Africa.
for (claim_type in claim_types) {
plot_number_of_conditions(patients, claim_type)
}plot_chronic_count(patients, 'inpatient')
plot_chronic_count(patients, 'outpatient')for (variable_name in chronic_conditions) {
plot_chronic_percent(patients_concatenated, variable_name)
}for (claim_type in claim_types) {
plot_patient_age(patients, claim_type)
}
for (variable_name in chronic_conditions) {
plot_chronic_by_age(patients, variable_name)
}for (claim_type in claim_types) {
to_plot <- patients[[claim_type]]
x_axis_values <- seq(min(to_plot$claim_count),
max(to_plot$claim_count))
title <- str_c('Number of visits made by ', claim_type, 's')
fig <- to_plot %>%
ggplot(aes(x = claim_count)) +
geom_bar(fill = 'navyblue') +
scale_x_discrete('Number of visits', limits = factor(x_axis_values)) +
ylab('Number of patients') +
ggtitle(title)
print(fig)
}
for (variable_name in chronic_conditions) {
plot_visits_per_patient(patients, variable_name)
}for (variable_name in chronic_conditions) {
plot_chronic_by_location(patients, variable_name)
}The data on hospital visits includes a column ClmAdmitDiagnosisCode that gives the diagnosis at the time the patient was admitted. This code can be roughly identified as the reason for the hospital visit.
The top five reasons for inpatient visits:
The top five reasons for outpatient visits:
For simplicity, providers that have the “potential fraud” label, as well as claims submitted by these providers, will be labeled as “fraudulent” in the remainder of the presentation.
The plots below show that for inpatient claims, both doctors and patients can both be roughly classified into two groups: those with no fraudulent claims and those with only fraudulent claims.
The first plot below also shows, however, a small but still significant fraction of inpatients has 50% fraudulent claims.
Note that the second plot below shows that in the category of inpatients having only a single claim, the median percentage of fraudulent claims is 100%. (About 57% of the inpatients with only a single claim are associated with a fraudulent provider.)
The rough separation of doctors into two groups can also be seen in the scatter plot below. Doctors with a sufficiently large number of claims are exclusively in the “only fraudulent claims” group.
for (claim_type in claim_types) {
plot_fraud_per_individual(
patients,
count_type = 'provider',
individual_type = 'patient',
claim_type = claim_type
)
}for (claim_type in claim_types) {
plot_fraud_per_individual(
patients,
count_type = 'claim',
individual_type = 'patient',
claim_type = claim_type
)
}for (claim_type in claim_types) {
plot_fraud_per_individual(
doctors,
count_type = 'provider',
individual_type = 'doctor',
claim_type = claim_type
)
}Note that for bar charts in which the number of claims per doctor is shown on the horizontal axis, this axis has been truncated, because a few doctors have a very large number of claims. The true range can be seen from the scatter plots of fraudulent claims per doctor vs total claims per doctor.
for (claim_type in claim_types) {
plot_fraud_vs_legit(doctors, claim_type)
plot_fraud_per_individual(
doctors,
count_type = 'claim',
individual_type = 'doctor',
claim_type = claim_type,
count_limit = 30.5
)
}There is a strong association between fraud and the number of claims submitted by a provider.
The plot below illustrates this association for inpatient claims. In this dataset, only fraudulent providers submit more than about 100 inpatient claims.
plot_provider_claim_counts(provider_claim_counts, 'inpatient',
include_months = FALSE)
plot_provider_claim_counts(provider_claim_counts, 'outpatient',
include_months = FALSE)The maximum number of inpatient claims for a provider in 2009 is 502.
The maximum number of outpatient claims for a provider in 2009 is 8208.
Is there trend in the number of monthly claims submitted by providers during the year?
The plots below show that there is a trend, but this trend does not have a significant association with fraud.
plot_provider_claim_counts(provider_claim_counts, 'inpatient',
include_year = FALSE)
plot_provider_claim_counts(provider_claim_counts, 'outpatient',
include_year = FALSE)The cost of hospital visits doesn’t show a strong association with fraud.
For multiple measures of payment per patient, a scatter plot of payment vs the patient’s percentage of fraudulent claims shows a symmetric pattern. This association is potentially significant, although it is difficult to interpret.
for (claim_type in claim_types) {
title <- str_c('Visit cost for ', claim_type, 's')
fig_base <- claims[[claim_type]] %>%
ggplot(aes(x = visit_cost)) +
log_scale_dollar('Visit cost', 'x') +
ggtitle(title)
plot_histograms(fig_base, y_label = 'visits', bins = 20)
}for (claim_type in claim_types) {
to_plot <- claims[[claim_type]] %>%
select(visit_cost, InscClaimAmtReimbursed,
DeductibleAmtPaid, PotentialFraud) %>%
filter(visit_cost != 0) %>%
mutate(percent_covered = InscClaimAmtReimbursed / visit_cost)
title <- str_c('Insurer payments, ', claim_type, ' visits')
plot_payment_distribution(to_plot, InscClaimAmtReimbursed, title)
title <- str_c('Deductible, ', claim_type, ' visits')
plot_payment_distribution(to_plot, DeductibleAmtPaid, title)
title <- str_c('Insurer payment vs deductible, ', claim_type, ' visits')
fig <- to_plot %>%
ggplot(aes(x = DeductibleAmtPaid, y = InscClaimAmtReimbursed)) +
geom_point(aes(color = PotentialFraud)) +
log_scale_dollar('Deductible', 'x') +
log_scale_dollar('Payment by insurer', 'y') +
ggtitle(title)
print(fig)
title <- str_c('Percent covered by insurer, ', claim_type, ' visits')
fig_base <- to_plot %>%
ggplot(aes(percent_covered)) +
scale_x_continuous('Percent covered by insurer',
labels = percent_format()) +
ggtitle(title)
plot_histograms(fig_base, y_label = 'visits', bins = 30)
}for (claim_type in claim_types) {
to_plot <- patients[[claim_type]] %>%
select(all_of(payment_variables), claim_fraud_fraction) %>%
# Patients for which the only visit has visit_cost == 0 show up
# with missing values for the columns in payment_variables.
drop_na()
for (variable_name in payment_variables) {
title <- str_c(payment_labels[variable_name], ', ',
claim_type, 's')
fig <- to_plot %>%
ggplot(aes_string(x = variable_name)) +
geom_point(aes(y = claim_fraud_fraction),
color = 'navyblue') +
log_scale_dollar(axis_label = 'Payment_amount', axis = 'x') +
scale_y_continuous('Patient percentage fraudulent claims',
labels = label_percent()) +
ggtitle(title)
print(fig)
}
}Patient payments do not show a strong dependence on individual chronic conditions that a patient has, possibly because many patients have multiple chronic conditions.
plot_payments_by_chronic(patients)“Inpatient visit duration” is used here to mean the time between an inpatient’s admission and discharge dates.
Plots in the tabbed subsections show that inpatient visit duration is affected by patient age and by the number of chronic conditions a patient has. But there isn’t a strong association between fraud and visit duration.
The 36-day upper bound on visit durations is striking. I didn’t find any indication that 36 days is a significant threshold for inpatient visits covered by Medicare.
title <- 'Distribution of inpatient visit duration'
fig_base <- inpatient_claims %>%
ggplot(aes(x = visit_duration)) +
xlab('Duration (days)') +
ggtitle(title)
plot_bar_charts(fig_base, 'visits')fig <- inpatient_claims %>%
ggplot(aes(x = patient_age, y = visit_duration)) +
geom_point(aes(color = PotentialFraud)) +
scale_x_continuous('Patient age (years)') +
scale_y_continuous('Visit duration (days)') +
ggtitle('Inpatient visit duration vs patient age')
print(fig)to_plot <- inpatient_claims %>%
select(BeneID, visit_duration) %>%
left_join(
select(inpatients, BeneID, all_of(chronic_conditions)),
by = 'BeneID'
) %>%
mutate(across(all_of(chronic_conditions), ~ (.) == 'Y'))
to_plot$number_of_conditions <- factor(rowSums(to_plot[, chronic_conditions]))
fig <- to_plot %>%
ggplot(aes(x = number_of_conditions, y = visit_duration)) +
geom_boxplot(aes(fill = number_of_conditions)) +
xlab('Number of chronic conditions') +
scale_y_continuous('Visit duration (days)') +
guides(fill = 'none') +
ggtitle('Inpatient visit duration vs number of chronic conditions')
print(fig)As discussed in the tabbed subsection Terminology of Sec. 2 above, patients can stay at a hospital for multiple nights without being formally admitted as an inpatient.
The “claim duration”, given by (ClaimEndDt - ClaimStartDt), appears to be the duration of the patient’s stay in the hospital, including any period in which the patient was not formally admitted as an inpatient.
Similarly to visit duration, claim duration does not have a strong association with fraud.
The distribution of visit cost per day is nearly identical for fraudulent and legitimate claims, as illustrated in the plots below.
Note that for both inpatient and outpatient visits, the claims with maximum value of claim duration are all fraudulent. Close inspection of the plots, however, shows that these “all fraud” bars correspond to a very small number of claims.
for (claim_type in claim_types) {
title <- str_c('Distribution of ', claim_type, ' claim duration')
fig_base <- claims[[claim_type]] %>%
ggplot(aes(x = claim_duration)) +
xlab('Duration (days)') +
ggtitle(title)
plot_bar_charts(fig_base, 'visits')
}for (claim_type in claim_types) {
to_plot <- claims[[claim_type]] %>%
filter(visit_cost != 0) %>%
select(PotentialFraud, visit_cost, claim_duration, cost_per_claim_day)
title <- str_c('Claim amount vs claim duration, ', claim_type, ' visits')
fig_base <- to_plot %>%
ggplot(aes(x = claim_duration, y = visit_cost)) +
scale_x_continuous('Claim duration (days)')
plot_cost_vs_duration(fig_base, title)
plot_cost_per_day(to_plot, variable_name = 'cost_per_claim_day',
claim_type = claim_type, duration_label = 'claim')
}It is natural to suspect that fraudulent claims are duplicates of legitimate claims in some sense, particularly given that the distribution of cost per day for fraudulent and legitimate claims is essentially identical.
This section explores the possibility that an important method of duplication is to copy all diagnosis and procedure codes associated with a claim.
The first plot below shows that only a tiny fraction of inpatient claims are duplicated in this way, so further research is needed into methods used to duplicate claims.
Even though only a small number of inpatient claims are duplicated in this way, analysis of identical groups of claims reveals a category containing only fraudulent claims.
plot_duplicate_counts(claims)Consider only the claims for which there is at least one duplicate, and collect these into groups of identical claims. In these groups, how many claims are there per provider?
The second plot below shows that for inpatient claims, the group with two claims per provider is 100% fraud.
to_plot <- filter_out_no_codes(inpatient_claims) %>%
filter(identical_claim_count > 1)
title_per_provider <- 'Number of claims per provider in identical groups, '
fig_base <- to_plot %>%
ggplot(aes(x = as.factor(identical_claims_per_provider))) +
xlab('Number of claims per provider') +
ggtitle(str_c(title_per_provider, 'inpatients'))
plot_bar_charts(fig_base, 'claims')
to_plot <- filter_out_no_codes(outpatient_claims) %>%
filter(identical_claim_count > 1)
fig_base <- to_plot %>%
ggplot(aes(x = identical_claims_per_provider)) +
scale_x_continuous('Number of claims per provider',
breaks = seq(from = 1.0, to = 2.2, by = 0.2)) +
ggtitle(str_c(title_per_provider, 'outpatients'))
plot_histograms(fig_base, y_label = 'claims', bins = 30,
breaks = seq(from = 20e3, to = 100e3, by = 20e3))Plots of weekly visit counts were used to choose the range of dates included in time-series decompositions.
The daily visit counts show a small weekly seasonality. For instance, the most common day for the start of an outpatient visit is Monday, and the least common day is Friday.
From the plot of weekly inpatient visits, it appears that the first and last weekly dates with complete data may be 2008-12-29, 2009-12-21, respectively.
For the outpatient data, the period in which weekly data appears complete is 2009-01-05 to 2009-12-21.
Outside of these ranges, it’s possible that the data is incomplete, and so the time-series decompositions will include only the data in these ranges.
for (claim_type in claim_types) {
plot_weekly_counts(claims, claim_type)
}for (claim_type in claim_types) {
date_range <- valid_date_ranges[[claim_type]]
to_plot <- claims[[claim_type]] %>%
extract_series_data(date_range)
title <- str_c('Daily visit count, ', claim_type, 's')
fig <- plot_series(to_plot, title)
print(fig)
stl_model <- get_stl_model(to_plot)
title <- str_c(
'Decomposition of daily-visit time series, ', claim_type, 's'
)
print(plot(stl_model, main = title))
title <- str_c('Weekly seasonality in visit count, ', claim_type, 's')
fig <- plot_seasonality(to_plot, stl_model, title)
print(fig)
trend_curve <- trend(stl_model)
trend_data <- data.frame(date = to_plot$ClaimStartDt, trend = trend_curve)
title <- str_c('Trend in visit count, ', claim_type, 's')
fig <- trend_data %>%
ggplot(aes(x = date, y = trend)) +
geom_line(color = 'navyblue') +
xlab('Date') +
ylab('Number of visits') +
ggtitle(title)
print(fig)
}Does the weekly seasonality depend on the reason for a hospital visit (as determined by the initial diagnosis code)?
It is tempting to interpret the weekly seasonalities shown below as meaningful, but the variation during the week is quite small, and the number of samples is much smaller than for the full sets of inpatient or outpatient claims. Further analysis would be needed to determine whether these apparent seasonalities are just random noise.
for (claim_type in claim_types) {
date_range <- valid_date_ranges[[claim_type]]
frequent_codes <- freq_admit_codes[[claim_type]] %>%
arrange(desc(count))
claim_data <- claims[[claim_type]] %>%
filter(ClmAdmitDiagnosisCode %in% frequent_codes$code)
for (admit_code in frequent_codes$code) {
description <- code_descriptions[admit_code]
series_data <- claim_data %>%
filter(ClmAdmitDiagnosisCode == admit_code) %>%
extract_series_data(date_range)
title <- str_c('Daily visit count for ', tolower(description), ', ',
claim_type, 's')
fig <- plot_series(series_data, title)
print(fig)
stl_model <- get_stl_model(series_data)
title <- str_c('Weekly seasonality for ', tolower(description), ', ',
claim_type, 's')
fig <- plot_seasonality(series_data, stl_model, title)
print(fig)
}
}The strongest predictor of fraud in the dataset is the number of claims submitted by a provider. The fraudulent providers, which represent a relatively small percentage of the total, are responsible for a substantial percentage of claims submitted.
For inpatient claims, both doctors and patients can both be roughly classified into two groups: those with no fraudulent claims and those with only fraudulent claims.
In addition to this broad generalization, the analysis shows some subtler patterns in fraud associations involving patients.
Duplication of claims is a promising area for further investigation.